Species Distribution Modelling
Russell Dinnage
Russell Dinnage
Data Processing Steps
Model Building
Data feeds into Models
Models inform Data processing steps
library(sdmpack) library(tidyverse) library(tidymodels)
RF_abund has data on the abundance of different reef species at different reefs around the worlddata("RF_abund")
RF_abund

fish_dat <- RF_abund %>% filter(SpeciesName == "Thalassoma pavo")
lm() is the basic linear model function in R.mod <- lm( AbundanceAdult40 ~ MeanTemp_CoralWatch, data = fish_dat)
mod <- lm( AbundanceAdult40 ~ MeanTemp_CoralWatch, data = fish_dat)
lhs ~ rhs, where lhs stands for Left Hand Side, and rhs stands for Right Hand Side.lhs contains ‘response’ variables that we wish to model as a function of the variables on the rhs: the ‘predictor’ variablelhs and rhs can contain multiple variables and function calls (we will see examples of that later on)lhs, they must always have a ~ and a rhs (~ rhs is a valid formula)summary(mod)
ggplot(fish_dat, aes(x = MeanTemp_CoralWatch, y = AbundanceAdult40)) + geom_smooth(method = lm, se = FALSE, color = "red") + geom_point() + theme_minimal()
It is much easier to tell if the model is useful at all by visualizing.
Distribution of Abundance:
hist(fish_dat$AbundanceAdult40, breaks = 100)

hist(log(fish_dat$AbundanceAdult40), breaks = 50)

However the main problem seems to stem from the fact that a line is not a good description on the response. Therefore, most SDM methods that are actually used allow for non-linear relationships. In this case (and many cases), some kind of hump shaped function might make sense.

tidymodelstidymodels is an R package for statistical and machine learning modelstidyversetidyverse it is a meta-package, bundling several other packages togetherparsnip: a tidymodels package to fit modelsparsnip separates the model ‘structure’ from the implementationparsnipmod_pars <- linear_reg(engine = "lm") mod_pars
parsnipfit() functionparsnip is designed work with pipe operators (%>% or |>)mod_pars <- linear_reg(engine = "lm") %>%
fit(AbundanceAdult40 ~ MeanTemp_CoralWatch,
data = fish_dat)
mod_pars
parsniptidy() function.mod_summ <- tidy(mod_pars) mod_summ
parsnippreds <- predict(mod_pars, newdata = fish_dat) preds pred_dat <- fish_dat %>% bind_cols(preds)
p <- ggplot(pred_dat, aes(x = MeanTemp_CoralWatch, y = .pred)) + geom_line() + geom_point(aes(y = AbundanceAdult40)) + theme_minimal() suppressMessages(print(p))
p <- ggplot(pred_dat, aes(x = MeanTemp_CoralWatch, y = .pred)) + geom_line() + geom_point(aes(y = AbundanceAdult40)) + scale_y_continuous(trans = "log1p") + theme_minimal() suppressMessages(print(p))
parsnipparsnip, it is easy to change our model to one that can model nonlinear relationships easily. Let’s try doing a gradient boosted decision tree (you don’t need to know what that is for now).mod_pars2 <- boost_tree(mode = "regression") %>%
fit(AbundanceAdult40 ~ MeanTemp_CoralWatch,
data = fish_dat)
mod_pars2
parsnippreds2 <- predict(mod_pars2, newdata = fish_dat) pred_dat2 <- fish_dat %>% bind_cols(preds2)
parsnipp <- ggplot(pred_dat2, aes(x = MeanTemp_CoralWatch, y = .pred)) + geom_line() + geom_point(aes(y = AbundanceAdult40)) + scale_y_continuous(trans = "log1p") + theme_minimal() suppressMessages(print(p))
An important concept in Data Science and Machine Learning is the idea of overfitting. The above model appears to ‘overfit’ the data – its predictions jump around wildly to try and fit each individual data point. But this in not likely to generalize well if applied to a new dataset. We reduce overfitting by tuning the ‘hyper-parameters’ of the algorithm to make it produce ‘smoother’ predictions. Smoothed prediction are more likely to generalize better to new datasets. This is accomplished using a method called cross-validation, which we will cover in depth in Week 7.

Image citation:
Trotta, Lauren B., et al. “Community phylogeny of the globally critically imperiled pine rockland ecosystem.” American journal of botany 105.10 (2018): 1735-1747.
We will start by extracting data from this:

tidymodels